First of all, the objective of this Case Study will be to show the advantages of Bayesian Statistics for small data sets and the ability to estimated better the posterior parameters.. As it is known, Bayesian Statistics allows as to set up prior believes of our predictors with specific probability distributions. This is really useful when we do not have a lot of data and we have some insights on the data. For this reason I have decided to use a data set with 21 variables and I will be reducing the number of observations to simulate what we are trying to show. This data is about the COVID cases in Mexico and the goal is to predict if a patient has COVID or not.
[[https://www.kaggle.com/datasets/meirnizri/covid19-dataset][Dataset]]
rm(list = ls())
data = read.csv("data.csv", header = TRUE)
dim(data)
## [1] 1048575 21
summary(data)
## USMER MEDICAL_UNIT SEX PATIENT_TYPE
## Min. :1.000 Min. : 1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.: 4.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :12.000 Median :1.000 Median :1.000
## Mean :1.632 Mean : 8.981 Mean :1.499 Mean :1.191
## 3rd Qu.:2.000 3rd Qu.:12.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :13.000 Max. :2.000 Max. :2.000
## DATE_DIED INTUBED PNEUMONIA AGE
## Length:1048575 Min. : 1.00 Min. : 1.000 Min. : 0.00
## Class :character 1st Qu.:97.00 1st Qu.: 2.000 1st Qu.: 30.00
## Mode :character Median :97.00 Median : 2.000 Median : 40.00
## Mean :79.52 Mean : 3.347 Mean : 41.79
## 3rd Qu.:97.00 3rd Qu.: 2.000 3rd Qu.: 53.00
## Max. :99.00 Max. :99.000 Max. :121.00
## PREGNANT DIABETES COPD ASTHMA
## Min. : 1.00 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.00 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000
## Median :97.00 Median : 2.000 Median : 2.000 Median : 2.000
## Mean :49.77 Mean : 2.186 Mean : 2.261 Mean : 2.243
## 3rd Qu.:97.00 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :98.00 Max. :98.000 Max. :98.000 Max. :98.000
## INMSUPR HIPERTENSION OTHER_DISEASE CARDIOVASCULAR
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000
## Median : 2.000 Median : 2.000 Median : 2.000 Median : 2.000
## Mean : 2.298 Mean : 2.129 Mean : 2.435 Mean : 2.262
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :98.000 Max. :98.000 Max. :98.000 Max. :98.000
## OBESITY RENAL_CHRONIC TOBACCO CLASIFFICATION_FINAL
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. :1.000
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.:3.000
## Median : 2.000 Median : 2.000 Median : 2.000 Median :6.000
## Mean : 2.125 Mean : 2.257 Mean : 2.214 Mean :5.306
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.:7.000
## Max. :98.000 Max. :98.000 Max. :98.000 Max. :7.000
## ICU
## Min. : 1.00
## 1st Qu.:97.00
## Median :97.00
## Mean :79.55
## 3rd Qu.:97.00
## Max. :99.00
The raw data set consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means “yes” and 2 means “no”. values as 97 and 99 are missing data.
Here we can see a summary of the data, first we have to clean and adapt the data so we can work on it. First of all, I will create the variable that we want to predict that is if a patient has been diagnosed with COVID or not.
data$COVID = ifelse(data$CLASIFFICATION_FINAL <= 3, 1, 2)
data = subset(data, select = -c(CLASIFFICATION_FINAL))
convertToLogic = function(col.name, df) {
index = which(names(df) == col.name)
print(index)
if (length(index) != 0) {
df[, index] = ifelse(df[, index] == 2, 0, df[, index])
df[, index] = as.logical(df[, index])
}
return(df)
}
This column will tell us if a patient has been diagnosed with COVID or not. Then, I will factor and format all the other variables to adapt them properly.
data = convertToLogic("COVID", data)
## [1] 21
data$USMER = ifelse(data$USMER == 2, 0, data$USMER)
data$USMER = as.logical(data$USMER)
data$MEDICAL_UNIT = factor(data$MEDICAL_UNIT)
data$SEX = factor(data$SEX, labels = c("female", "male"), levels = c(1, 2))
data$PATIENT_TYPE = factor(data$PATIENT_TYPE, labels = c("returned home", "hospitalized"), levels = c(1, 2))
data$INTUBED = factor(data$INTUBED, labels = c("intubed", "not intubed"), levels = c(1, 2))
data$PNEUMONIA = factor(data$PNEUMONIA, labels = c("pneumonia", "not pneumonia"), levels = c(1, 2))
data$PREGNANT = factor(data$PREGNANT, labels = c("pregnant", "not pregnant"), levels = c(1, 2))
data$DIABETES = factor(data$DIABETES, labels = c("diabetes", "not diabetes"), levels = c(1, 2))
data$COPD = factor(data$COPD, labels = c("copd", "not copd"), levels = c(1, 2))
data$ASTHMA = factor(data$ASTHMA, labels = c("asthma", "not asthma"), levels = c(1, 2))
data$INMSUPR = factor(data$INMSUPR, labels = c("inmsupr", "not inmsupr"), levels = c(1, 2))
data$HIPERTENSION = factor(data$HIPERTENSION, labels = c("hipertension", "not hipertension"), levels = c(1, 2))
data$OTHER_DISEASE = factor(data$OTHER_DISEASE, labels = c("other desease", "not other desease"), levels = c(1, 2))
data$CARDIOVASCULAR = factor(data$CARDIOVASCULAR, labels = c("cardiovascular", "not cardiovascular"), levels = c(1, 2))
data$OBESITY = factor(data$OBESITY, labels = c("obesity", "not obesity"), levels = c(1, 2))
data$RENAL_CHRONIC = factor(data$RENAL_CHRONIC, labels = c("renal chronic", "not renal chronic"), levels = c(1, 2))
data$TOBACCO = factor(data$TOBACCO, labels = c("tobacco", "not tobacco"), levels = c(1, 2))
data$ICU = factor(data$ICU, labels = c("icu", "not icu"), levels = c(1, 2))
data = subset(data, select = -c(DATE_DIED))
summary(data)
## USMER MEDICAL_UNIT SEX PATIENT_TYPE
## Mode :logical 12 :602995 female:525064 returned home:848544
## FALSE:662903 4 :314405 male :523511 hospitalized :200031
## TRUE :385672 6 : 40584
## 9 : 38116
## 3 : 19175
## 8 : 10399
## (Other): 22901
## INTUBED PNEUMONIA AGE
## intubed : 33656 pneumonia :140038 Min. : 0.00
## not intubed:159050 not pneumonia:892534 1st Qu.: 30.00
## NA's :855869 NA's : 16003 Median : 40.00
## Mean : 41.79
## 3rd Qu.: 53.00
## Max. :121.00
##
## PREGNANT DIABETES COPD
## pregnant : 8131 diabetes :124989 copd : 15062
## not pregnant:513179 not diabetes:920248 not copd:1030510
## NA's :527265 NA's : 3338 NA's : 3003
##
##
##
##
## ASTHMA INMSUPR HIPERTENSION
## asthma : 31572 inmsupr : 14170 hipertension :162729
## not asthma:1014024 not inmsupr:1031001 not hipertension:882742
## NA's : 2979 NA's : 3404 NA's : 3104
##
##
##
##
## OTHER_DISEASE CARDIOVASCULAR OBESITY
## other desease : 28040 cardiovascular : 20769 obesity :159816
## not other desease:1015490 not cardiovascular:1024730 not obesity:885727
## NA's : 5045 NA's : 3076 NA's : 3032
##
##
##
##
## RENAL_CHRONIC TOBACCO ICU
## renal chronic : 18904 tobacco : 84376 icu : 16858
## not renal chronic:1026665 not tobacco:960979 not icu:175685
## NA's : 3006 NA's : 3220 NA's :856032
##
##
##
##
## COVID
## Mode :logical
## FALSE:656596
## TRUE :391979
##
##
##
##
Here we see that the data is correctly formated but there are some missing value, so let’s fix that.
First we will see how manu missing values there are by rows so we can remove some columns that have a lot of missing values.
print(length(which(is.na(data))))
## [1] 2288376
hist(rowMeans(is.na(data)), xlab = c("Missing values average by rows"), main = c())
Here we see that there are 3 columns with the most missing values.
indexesEmptyCols = which(colMeans(is.na(data)) != 0)
colsWithNA = sort(colMeans(is.na(data[, indexesEmptyCols])),
decreasing = TRUE)
barplot(colsWithNA, las=2)
And the columns that have the most missing values are ICU, INTUBED, and PREGNANT, so let’s remove them.
data = subset(data, select = -c(ICU, INTUBED, PREGNANT))
print(length(which(is.na(data))))
## [1] 49210
data = na.omit(data)
length(unique(which(is.na(data))))
## [1] 0
summary(data)
## USMER MEDICAL_UNIT SEX PATIENT_TYPE
## Mode :logical 12 :591811 female:513216 returned home:833253
## FALSE:658255 4 :307177 male :511936 hospitalized :191899
## TRUE :366897 6 : 37868
## 9 : 37384
## 3 : 18660
## 8 : 10097
## (Other): 22155
## PNEUMONIA AGE DIABETES
## pneumonia :137599 Min. : 0.00 diabetes :122415
## not pneumonia:887553 1st Qu.: 30.00 not diabetes:902737
## Median : 40.00
## Mean : 41.89
## 3rd Qu.: 53.00
## Max. :121.00
##
## COPD ASTHMA INMSUPR
## copd : 14376 asthma : 30497 inmsupr : 13588
## not copd:1010776 not asthma:994655 not inmsupr:1011564
##
##
##
##
##
## HIPERTENSION OTHER_DISEASE
## hipertension :159577 other desease : 27131
## not hipertension:865575 not other desease:998021
##
##
##
##
##
## CARDIOVASCULAR OBESITY RENAL_CHRONIC
## cardiovascular : 20126 obesity :156961 renal chronic : 18351
## not cardiovascular:1005026 not obesity:868191 not renal chronic:1006801
##
##
##
##
##
## TOBACCO COVID
## tobacco : 82675 Mode :logical
## not tobacco:942477 FALSE:636274
## TRUE :388878
##
##
##
##
Now the data set is clean so let’s start working on it. But first we will shrink it to 1000 observations to work with.
data.small = data[sample(nrow(data), size=1000),]
First of all, let’s plot a histogram of the COVID variable (the one we want to predict) and see.
rm(list = setdiff(ls(), c("data", "data.small")))
hist(as.numeric(data$COVID))
This is as we expected as we are going to be predicting a binary variable. So let’s use a Bernoulli distribution to explain this data and see how well it fits. First of all lets compute the analytical posterior distribution of the covid variable.
\[X \ | \ \theta \thicksim Bernoulli(\theta)\] \[f(x \ | \ \theta) = \theta^x \cdot (1 - \theta)^x\]
\[\theta \thicksim Beta(0, 0)\] \[f(\theta \ | \ 0, 0) = \frac{\theta^{0 - 1} \cdot (1 - \theta)^{0 - 1}}{B(0, 0)} \] 3. Now we get the likelihood
\[f(data \ | \ \theta) \propto \theta^{k} \cdot (1 - \theta)^{n - k}\]
Being n the total number of observations and k the positive ones.
\[f(\theta \ | \ data) = \frac{\theta^{k - 1} \cdot (1 - \theta)^{n - k - 1}}{B(k, n - k)} \]
\[\theta \ | \ data \thicksim Beta(k, n - k)\]
So now that we have the posterior distribution let’s obtain the prediction of the next value called Y given the data
\[Y \ | \ \theta \thicksim Bernuilli(\theta)\] \[P ( Y = 1 | data) = \int_{-\infty}^{\infty}{P(Y=1|\theta) \cdot P(\theta | data) d\theta} \\ = \frac{B(k + 1, n - k)}{B(k, n - k)}\]
n = as.numeric(length(data.small$COVID))
k = as.numeric(length(which(data.small$COVID)))
print(beta(k+1,n-k)/beta(k, n-k))
## [1] 0.38
And here we can see that the probability of a new patient of having covid is 0.362 that is really close to the ML estimator of 0.38
And finally let’s try to obtain the same result numerically
As we know the distribution of the new observation we will get a random sample and compare.
\[Y \ | \ \theta \thicksim Bernuilli ( Beta (k, n - k))\]
y.sample = rbinom(n, 1, rbeta(1, k, n - k))
mean(y.sample)
## [1] 0.36
Here we see that the estimated probability is almost the same as previously.
covid.prob = rbeta(n, k, n - k)
quantile(covid.prob, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 0.3504863 0.4123275
And also we see that the confidence interval for the probability of having covid is pretty narrow, so we can be sure that it is correct.
Now, we will see if the other variables are useful to predict if a patient has covid or not.
rm(list = setdiff(ls(), c("data", "data.small")))
library(ggplot2) # GGally
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Let’s see how the columns distribute with respect to the covid variable.
ggplot(data, aes(x=COVID, y=AGE)) +
geom_boxplot()
Here we see that there is a visible difference between the mean of the covid, so this can be a useful variable to use in our model.
plot(table(data$COVID, data$USMER))
plot(table(data$COVID, data$MEDICAL_UNIT))
plot(table(data$COVID, data$SEX))
plot(table(data$COVID, data$PATIENT_TYPE))
plot(table(data$COVID, data$PNEUMONIA))
plot(table(data$COVID, data$DIABETES))
plot(table(data$COVID, data$COPD))
plot(table(data$COVID, data$ASTHMA))
plot(table(data$COVID, data$INMSUPR))
plot(table(data$COVID, data$HIPERTENSION))
plot(table(data$COVID, data$OTHER_DISEASE))
plot(table(data$COVID, data$CARDIOVASCULAR))
plot(table(data$COVID, data$OBESITY))
plot(table(data$COVID, data$RENAL_CHRONIC))
plot(table(data$COVID, data$TOBACCO))
Now, the columns that show off the most are: MEDICAL_UNIT, SEX, PATITENT_TYPE, and PNEUMONIA. This makes sense and we will see after if we are confident that there is a visible difference.
Now let’s implement a simple LM model to see how well we can predict a patient to have covid.
rm(list = setdiff(ls(), c("data")))
library(caret)
## Loading required package: lattice
library(lattice)
data.small = data[sample(nrow(data), size=10000),]
index.test = createDataPartition(data.small$COVID, p = 0.5, list = FALSE)
data.test = data.small[index.test,]
data.train = data.small[-index.test,]
rm(index.test)
Now first of all let’s try to use all the variables to try to predict if a patient has covid or not.
fit = train(as.factor(COVID) ~ ., data = data.train, method = "glm", family = "binomial")
summary(fit)
##
## Call:
## NULL
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.647697 1.271962 -2.868 0.004134 **
## USMERTRUE 0.094284 0.065983 1.429 0.153031
## MEDICAL_UNIT2 -9.650933 196.971035 -0.049 0.960922
## MEDICAL_UNIT3 1.024084 1.170356 0.875 0.381563
## MEDICAL_UNIT4 1.164500 1.148457 1.014 0.310597
## MEDICAL_UNIT5 1.343421 1.195317 1.124 0.261054
## MEDICAL_UNIT6 1.016744 1.157167 0.879 0.379591
## MEDICAL_UNIT7 1.539165 1.567072 0.982 0.326005
## MEDICAL_UNIT8 1.175949 1.188158 0.990 0.322309
## MEDICAL_UNIT9 1.017897 1.158113 0.879 0.379441
## MEDICAL_UNIT10 1.568705 1.196400 1.311 0.189794
## MEDICAL_UNIT11 1.621831 1.205917 1.345 0.178659
## MEDICAL_UNIT12 1.018086 1.148563 0.886 0.375402
## MEDICAL_UNIT13 NA NA NA NA
## SEXmale 0.208295 0.061110 3.409 0.000653 ***
## PATIENT_TYPEhospitalized 0.379112 0.103549 3.661 0.000251 ***
## `PNEUMONIAnot pneumonia` -0.724558 0.111909 -6.475 9.51e-11 ***
## AGE 0.012094 0.002122 5.701 1.19e-08 ***
## `DIABETESnot diabetes` -0.036152 0.102216 -0.354 0.723578
## `COPDnot copd` 0.109181 0.270431 0.404 0.686413
## `ASTHMAnot asthma` -0.043505 0.174163 -0.250 0.802745
## `INMSUPRnot inmsupr` 0.333857 0.262762 1.271 0.203883
## `HIPERTENSIONnot hipertension` -0.137879 0.095191 -1.448 0.147492
## `OTHER_DISEASEnot other desease` 0.656741 0.206604 3.179 0.001479 **
## `CARDIOVASCULARnot cardiovascular` 0.777438 0.251359 3.093 0.001982 **
## `OBESITYnot obesity` -0.285248 0.083996 -3.396 0.000684 ***
## `RENAL_CHRONICnot renal chronic` 0.326682 0.226655 1.441 0.149495
## `TOBACCOnot tobacco` 0.260149 0.114280 2.276 0.022821 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6633.7 on 4999 degrees of freedom
## Residual deviance: 6291.0 on 4973 degrees of freedom
## AIC: 6345
##
## Number of Fisher Scoring iterations: 10
Here we see that there are a lot of variables that are useless. As the p value of the betas is really high for most of them.
confusionMatrix(as.factor(data.test$COVID), predict(fit, newdata = data.test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 2803 304
## TRUE 1411 482
##
## Accuracy : 0.657
## 95% CI : (0.6437, 0.6702)
## No Information Rate : 0.8428
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.177
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6652
## Specificity : 0.6132
## Pos Pred Value : 0.9022
## Neg Pred Value : 0.2546
## Prevalence : 0.8428
## Detection Rate : 0.5606
## Detection Prevalence : 0.6214
## Balanced Accuracy : 0.6392
##
## 'Positive' Class : FALSE
##
Here, we see that we get an accuracy of 0.6578 so it is not that bad, probably it is because we only have a few significant variables as we saw in the correlation graph. So let’s try a simpler model.
fit = train(as.factor(COVID) ~ USMER + PNEUMONIA + MEDICAL_UNIT + DIABETES + HIPERTENSION + AGE + PATIENT_TYPE, data = data.train, method = "glm", family = "binomial")
summary(fit)
##
## Call:
## NULL
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.377997 1.166096 -1.182 0.237317
## USMERTRUE 0.082135 0.065546 1.253 0.210174
## `PNEUMONIAnot pneumonia` -0.732420 0.110769 -6.612 3.79e-11 ***
## MEDICAL_UNIT2 -9.775325 196.971007 -0.050 0.960419
## MEDICAL_UNIT3 0.965563 1.165979 0.828 0.407606
## MEDICAL_UNIT4 1.128609 1.144594 0.986 0.324116
## MEDICAL_UNIT5 1.317184 1.190936 1.106 0.268723
## MEDICAL_UNIT6 0.994307 1.153313 0.862 0.388616
## MEDICAL_UNIT7 1.540980 1.542823 0.999 0.317889
## MEDICAL_UNIT8 1.274316 1.183086 1.077 0.281430
## MEDICAL_UNIT9 0.993985 1.154172 0.861 0.389122
## MEDICAL_UNIT10 1.599234 1.191857 1.342 0.179661
## MEDICAL_UNIT11 1.687446 1.201937 1.404 0.160337
## MEDICAL_UNIT12 1.003887 1.144657 0.877 0.380476
## MEDICAL_UNIT13 NA NA NA NA
## `DIABETESnot diabetes` -0.019602 0.100831 -0.194 0.845859
## `HIPERTENSIONnot hipertension` -0.132394 0.091982 -1.439 0.150051
## AGE 0.011505 0.002089 5.507 3.65e-08 ***
## PATIENT_TYPEhospitalized 0.336352 0.101510 3.313 0.000921 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6633.7 on 4999 degrees of freedom
## Residual deviance: 6344.4 on 4982 degrees of freedom
## AIC: 6380.4
##
## Number of Fisher Scoring iterations: 10
Now it is better but the medical unit for example, it is only relevant the level 2 and also for other.
confusionMatrix(as.factor(data.test$COVID), predict(fit, newdata = data.test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 2793 314
## TRUE 1432 461
##
## Accuracy : 0.6508
## 95% CI : (0.6374, 0.664)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.161
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6611
## Specificity : 0.5948
## Pos Pred Value : 0.8989
## Neg Pred Value : 0.2435
## Prevalence : 0.8450
## Detection Rate : 0.5586
## Detection Prevalence : 0.6214
## Balanced Accuracy : 0.6280
##
## 'Positive' Class : FALSE
##
Here we see that the accuracy is almost the same and the kappa so we have not lost a lot of info.
The frequentest approach is easier but we if we want to compute confidence intervals for the parameters or predictive intervals we cannot do them. That is why we will be using the Bayesian approach to better study the effects of each variable with covid and get more conclusions. The power of the Bayesian approach is that we obtain the posterior distribution of the parameters so we can study better the relation and the significance. So let’s start.
library(coda)
library(MASS)
library(MCMCpack)
rm(list = setdiff(ls(), c("data", "data.small", "data.test", "data.train")))
fit = MCMClogit(COVID ~ ., data = data.train, burnin=1000, mcmc=210000)
par(mar=c(1, 1, 1, 1))
plot(fit)
summary(fit)
##
## Iterations = 1001:211000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 210000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## (Intercept) -376.25257 2.097e+02 4.577e-01 12.125949
## USMERTRUE 0.08932 6.701e-02 1.462e-04 0.003578
## MEDICAL_UNIT3 373.57969 2.097e+02 4.577e-01 12.128977
## MEDICAL_UNIT4 373.71923 2.097e+02 4.577e-01 12.129152
## MEDICAL_UNIT5 373.89959 2.098e+02 4.577e-01 12.129814
## MEDICAL_UNIT6 373.57257 2.098e+02 4.577e-01 12.129274
## MEDICAL_UNIT7 374.08585 2.097e+02 4.576e-01 12.123135
## MEDICAL_UNIT8 373.71724 2.098e+02 4.578e-01 12.132534
## MEDICAL_UNIT9 373.57129 2.097e+02 4.577e-01 12.128097
## MEDICAL_UNIT10 374.10892 2.098e+02 4.578e-01 12.130807
## MEDICAL_UNIT11 374.17674 2.097e+02 4.577e-01 12.127994
## MEDICAL_UNIT12 373.57465 2.097e+02 4.577e-01 12.129264
## MEDICAL_UNIT13 372.12884 2.098e+02 4.578e-01 12.135588
## SEXmale 0.20522 6.008e-02 1.311e-04 0.003188
## PATIENT_TYPEhospitalized 0.39305 1.037e-01 2.262e-04 0.005817
## PNEUMONIAnot pneumonia -0.72429 1.056e-01 2.304e-04 0.005255
## AGE 0.01228 2.068e-03 4.513e-06 0.000108
## DIABETESnot diabetes -0.03273 1.008e-01 2.199e-04 0.005151
## COPDnot copd 0.13334 2.682e-01 5.852e-04 0.014348
## ASTHMAnot asthma -0.05817 1.739e-01 3.795e-04 0.009339
## INMSUPRnot inmsupr 0.36506 2.575e-01 5.619e-04 0.012960
## HIPERTENSIONnot hipertension -0.12582 9.500e-02 2.073e-04 0.005312
## OTHER_DISEASEnot other desease 0.68088 2.039e-01 4.449e-04 0.011037
## CARDIOVASCULARnot cardiovascular 0.78124 2.490e-01 5.433e-04 0.012709
## OBESITYnot obesity -0.28849 8.122e-02 1.772e-04 0.004159
## RENAL_CHRONICnot renal chronic 0.30085 2.270e-01 4.954e-04 0.012171
## TOBACCOnot tobacco 0.24477 1.182e-01 2.579e-04 0.006579
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75%
## (Intercept) -6.989e+02 -553.95676 -397.00642 -203.67551
## USMERTRUE -3.368e-02 0.04431 0.08627 0.13049
## MEDICAL_UNIT3 1.068e+01 200.71615 394.02809 551.60896
## MEDICAL_UNIT4 1.082e+01 200.97653 394.32868 551.57777
## MEDICAL_UNIT5 1.099e+01 201.19860 394.67356 551.62641
## MEDICAL_UNIT6 1.067e+01 200.82866 394.19995 551.33743
## MEDICAL_UNIT7 1.119e+01 201.96328 393.72363 551.66027
## MEDICAL_UNIT8 1.083e+01 200.91157 394.29261 551.70893
## MEDICAL_UNIT9 1.067e+01 200.91928 394.10903 551.34237
## MEDICAL_UNIT10 1.122e+01 201.73805 394.22645 551.86116
## MEDICAL_UNIT11 1.127e+01 201.95237 395.25460 551.53600
## MEDICAL_UNIT12 1.067e+01 200.80387 394.18120 551.41278
## MEDICAL_UNIT13 9.651e+00 198.92662 393.91489 551.86470
## SEXmale 8.722e-02 0.16646 0.20728 0.24744
## PATIENT_TYPEhospitalized 1.831e-01 0.32095 0.39396 0.46260
## PNEUMONIAnot pneumonia -9.458e-01 -0.78969 -0.73216 -0.65065
## AGE 8.157e-03 0.01083 0.01223 0.01357
## DIABETESnot diabetes -2.453e-01 -0.09810 -0.03524 0.02929
## COPDnot copd -3.799e-01 -0.03879 0.12502 0.31735
## ASTHMAnot asthma -4.142e-01 -0.16743 -0.06236 0.05590
## INMSUPRnot inmsupr -1.223e-01 0.20092 0.34731 0.54010
## HIPERTENSIONnot hipertension -3.147e-01 -0.18901 -0.12420 -0.05887
## OTHER_DISEASEnot other desease 2.751e-01 0.54139 0.67796 0.80816
## CARDIOVASCULARnot cardiovascular 3.224e-01 0.59946 0.77744 0.93374
## OBESITYnot obesity -4.472e-01 -0.33835 -0.28525 -0.23332
## RENAL_CHRONICnot renal chronic -1.158e-01 0.15806 0.31769 0.42816
## TOBACCOnot tobacco 2.816e-02 0.16188 0.24535 0.32309
## 97.5%
## (Intercept) -13.29863
## USMERTRUE 0.22965
## MEDICAL_UNIT3 695.83160
## MEDICAL_UNIT4 696.46238
## MEDICAL_UNIT5 696.62088
## MEDICAL_UNIT6 696.60999
## MEDICAL_UNIT7 696.83407
## MEDICAL_UNIT8 696.28339
## MEDICAL_UNIT9 696.12496
## MEDICAL_UNIT10 697.04661
## MEDICAL_UNIT11 696.39288
## MEDICAL_UNIT12 696.31882
## MEDICAL_UNIT13 695.16664
## SEXmale 0.32226
## PATIENT_TYPEhospitalized 0.58958
## PNEUMONIAnot pneumonia -0.51199
## AGE 0.01607
## DIABETESnot diabetes 0.16451
## COPDnot copd 0.65960
## ASTHMAnot asthma 0.28862
## INMSUPRnot inmsupr 0.85074
## HIPERTENSIONnot hipertension 0.05357
## OTHER_DISEASEnot other desease 1.09955
## CARDIOVASCULARnot cardiovascular 1.29783
## OBESITYnot obesity -0.13603
## RENAL_CHRONICnot renal chronic 0.78549
## TOBACCOnot tobacco 0.46551
From the Bayesian point of view, we see that the CI for all the parameters does not contain 0 so theoretically all of the predictors are significant with an alpha = 5%.
rm(list = setdiff(ls(), c("data", "data.small", "data.test", "data.train")))
library(monomvn)
x = data.frame(lapply(subset(data.train, select = -c(COVID)), function(x) as.numeric((x))))
adaptTo0And1 = function(col.name, df) {
index = which(names(df) == col.name)
if (length(index) != 0) {
df[, index] = ifelse(df[, index] == 2, 0, df[, index])
}
return(df)
}
x = adaptTo0And1("SEX", x)
x = adaptTo0And1("PATIENT_TYPE", x)
x = adaptTo0And1("PNEUMONIA", x)
x = adaptTo0And1("DIABETES", x)
x = adaptTo0And1("COPD", x)
x = adaptTo0And1("ASTHMA", x)
x = adaptTo0And1("INMSUPR", x)
x = adaptTo0And1("HIPERTENSION", x)
x = adaptTo0And1("OTHER_DISEASE", x)
x = adaptTo0And1("CARDIOVASCULAR", x)
x = adaptTo0And1("OBESITY", x)
x = adaptTo0And1("RENAL_CHRONIC", x)
x = adaptTo0And1("TOBACCO", x)
summary(x)
## USMER MEDICAL_UNIT SEX PATIENT_TYPE
## Min. :0.0000 Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 4.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :12.000 Median :1.0000 Median :1.0000
## Mean :0.3474 Mean : 8.991 Mean :0.5066 Mean :0.8098
## 3rd Qu.:1.0000 3rd Qu.:12.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :13.000 Max. :1.0000 Max. :1.0000
## PNEUMONIA AGE DIABETES COPD
## Min. :0.0000 Min. : 0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :41.00 Median :0.0000 Median :0.0000
## Mean :0.1392 Mean :41.79 Mean :0.1216 Mean :0.0134
## 3rd Qu.:0.0000 3rd Qu.:52.00 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :99.00 Max. :1.0000 Max. :1.0000
## ASTHMA INMSUPR HIPERTENSION OTHER_DISEASE
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.0314 Mean :0.0148 Mean :0.1552 Mean :0.0258
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## CARDIOVASCULAR OBESITY RENAL_CHRONIC TOBACCO
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.0178 Mean :0.1578 Mean :0.0198 Mean :0.0816
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
y = data.train$COVID
fit = blasso(x, y, mprior = c(0,1))
## t=100, m=14
## t=200, m=9
## t=300, m=12
## t=400, m=10
## t=500, m=12
## t=600, m=13
## t=700, m=12
## t=800, m=10
## t=900, m=9
After training the model lets check for stability
plot(fit, burnin=200, which="m")
acf(fit$m)
Here we see that there is some autocorrelation and it is not stable enough so to make sure let’s add a lot more samples and some thinning.
set.seed(111)
fit = blasso(x, y, mprior = c(0,1), T = 10000, thin = 20)
## t=100, m=9
## t=200, m=13
## t=300, m=11
## t=400, m=11
## t=500, m=12
## t=600, m=11
## t=700, m=9
## t=800, m=11
## t=900, m=13
## t=1000, m=8
## t=1100, m=10
## t=1200, m=11
## t=1300, m=7
## t=1400, m=11
## t=1500, m=10
## t=1600, m=9
## t=1700, m=11
## t=1800, m=10
## t=1900, m=10
## t=2000, m=12
## t=2100, m=13
## t=2200, m=11
## t=2300, m=9
## t=2400, m=11
## t=2500, m=11
## t=2600, m=10
## t=2700, m=10
## t=2800, m=10
## t=2900, m=12
## t=3000, m=11
## t=3100, m=8
## t=3200, m=13
## t=3300, m=11
## t=3400, m=13
## t=3500, m=14
## t=3600, m=12
## t=3700, m=12
## t=3800, m=11
## t=3900, m=12
## t=4000, m=10
## t=4100, m=8
## t=4200, m=11
## t=4300, m=10
## t=4400, m=13
## t=4500, m=9
## t=4600, m=10
## t=4700, m=12
## t=4800, m=12
## t=4900, m=8
## t=5000, m=10
## t=5100, m=9
## t=5200, m=10
## t=5300, m=10
## t=5400, m=13
## t=5500, m=7
## t=5600, m=10
## t=5700, m=11
## t=5800, m=11
## t=5900, m=10
## t=6000, m=10
## t=6100, m=10
## t=6200, m=11
## t=6300, m=9
## t=6400, m=11
## t=6500, m=12
## t=6600, m=12
## t=6700, m=11
## t=6800, m=10
## t=6900, m=9
## t=7000, m=13
## t=7100, m=14
## t=7200, m=13
## t=7300, m=11
## t=7400, m=13
## t=7500, m=10
## t=7600, m=11
## t=7700, m=10
## t=7800, m=14
## t=7900, m=12
## t=8000, m=12
## t=8100, m=10
## t=8200, m=12
## t=8300, m=11
## t=8400, m=7
## t=8500, m=10
## t=8600, m=10
## t=8700, m=10
## t=8800, m=11
## t=8900, m=10
## t=9000, m=10
## t=9100, m=10
## t=9200, m=10
## t=9300, m=10
## t=9400, m=11
## t=9500, m=10
## t=9600, m=11
## t=9700, m=9
## t=9800, m=10
## t=9900, m=14
plot(fit, burnin=1000, which="m")
acf(fit$m)
Now we do not see any periodicity and it is much stable.
plot(fit, burnin=1000, which="s2")
plot(fit, burnin=1000, which="lambda2")
And also we see that it is stable so we can trust that it has converged. So let’s see the most important variables.
plot(fit, burnin=1000)
points(drop(fit$b), col=2, pch=20)
points(drop(fit$b), col=3, pch=18)
legend("topleft", c("blasso-map", "lasso", "lsr"),
col=c(2,2,3), pch=c(21,20,18))
s <- summary(fit, burnin=1000)
print(s$bn0) # probability that each beta coef != zero
## b.1 b.2 b.3 b.4 b.5 b.6 b.7 b.8
## 0.3692222 0.6027778 0.9753333 0.9996667 1.0000000 1.0000000 0.2935556 0.2718889
## b.9 b.10 b.11 b.12 b.13 b.14 b.15 b.16
## 0.2511111 0.4492222 0.4927778 0.9616667 0.9592222 0.9837778 0.4012222 0.7170000
barplot(s$bn0, horiz = TRUE)
Here we see that the most important variables are SEX, PATIENT_TYPE, PNEUMONIA, AGE, CARDIOVASCULAR. All in all we can conclude that the post important variables to predict if a patient has covid or not is:
Finally we will create a model with the variables selected with lasso.
rm(list = setdiff(ls(), c("data", "data.small", "data.test", "data.train")))
library(R2OpenBUGS)
logit.bayes <- function(){
for( i in 1 : n ) {
COVID.bin[i] ~ dbern(p[i])
logit(p[i]) <- b0 + b1 * AGE[i] + b2*SEX.male[i] + b3*PATITENT_TYPE.hospitalized[i] + b4*PNEUMONIA.yes[i] + b5*CARDIOVASCULAR.yes[i]
}
b0 ~ dnorm(0.0, 1.0E-6)
b1 ~ dnorm(0.0, 1.0E-6)
b2 ~ dnorm(0.0, 1.0E-6)
b3 ~ dnorm(0.0, 1.0E-6)
b4 ~ dnorm(0.0, 1.0E-6)
b5 ~ dnorm(0.0, 1.0E-6)
}
COVID.bin=ifelse(data.train$COVID,1,0)
n=length(COVID.bin)
SEX.male = ifelse(data.train$SEX=="male",1,0)
PATITENT_TYPE.hospitalized = ifelse(data.train$PATIENT_TYPE=="hospitalized",1,0)
PNEUMONIA.yes = ifelse(data.train$PNEUMONIA=="pneumonia",1,0)
CARDIOVASCULAR.yes = ifelse(data.train$CARDIOVASCULAR=="cardiovascular",1,0)
data <- list(n=n, COVID.bin=COVID.bin, AGE=data.train$AGE, SEX.male = SEX.male, PATITENT_TYPE.hospitalized = PATITENT_TYPE.hospitalized, PNEUMONIA.yes=PNEUMONIA.yes, CARDIOVASCULAR.yes = CARDIOVASCULAR.yes)
inits <- function(){
list(b0 = 1, b1 = 0, b2 = 0, b3 = 0, b4 = 0, b5 = 0)
}
output <- bugs(data = data, inits = inits, parameters.to.save = c("b0", "b1", "b2", "b3", "b4", "b5"), model.file = logit.bayes, n.chains = 1, n.burnin=100, n.iter = 1000)
output
## Inference for Bugs model at "/tmp/RtmpgctoDl/model1577061a476ce.txt",
## Current: 1 chains, each with 1000 iterations (first 100 discarded)
## Cumulative: n.sims = 900 iterations saved
## mean sd 2.5% 25% 50% 75% 97.5%
## b0 -1.3 0.1 -1.5 -1.4 -1.3 -1.3 -1.1
## b1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## b2 0.2 0.1 0.1 0.2 0.2 0.2 0.3
## b3 0.4 0.1 0.2 0.3 0.4 0.5 0.6
## b4 0.7 0.1 0.5 0.7 0.7 0.8 0.9
## b5 -0.8 0.2 -1.3 -1.0 -0.8 -0.6 -0.3
## deviance 6345.8 3.6 6341.0 6343.0 6345.0 6348.0 6354.5
##
## DIC info (using the rule, pD = Dbar-Dhat)
## pD = 6.1 and DIC = 6352.0
## DIC is an estimate of expected predictive error (lower deviance is better).
b0.post <-output$sims.list$b0
b1.post <-output$sims.list$b1
b2.post <-output$sims.list$b2
b3.post <-output$sims.list$b3
b4.post <-output$sims.list$b4
b5.post <-output$sims.list$b5
ts.plot(b0.post)
acf(b0.post)
ts.plot(b1.post)
acf(b1.post)
ts.plot(b2.post)
acf(b2.post)
ts.plot(b3.post)
acf(b3.post)
ts.plot(b4.post)
acf(b4.post)
ts.plot(b5.post)
acf(b5.post)
We see that the model is stable so let’s check some assumptions. First let’s create the baseline.
linear = b0.post
pred.baseline = exp(linear)/(1+exp(linear))
mean(pred.baseline)
## [1] 0.209602
quantile(pred.baseline,c(0.025,0.975))
## 2.5% 97.5%
## 0.1801214 0.2427063
Here we see that a baseline person has a 0.22 probability of having covid. So now let’s compare it to other groups.
20 years old
linear = b0.post+b1.post * 20
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.2566922
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.2346928 0.2799761
50 years old
linear = b0.post+b1.post * 50
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.3395505
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.3173965 0.3606602
80 years old
linear = b0.post+b1.post * 80
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.4337181
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.3900243 0.4776586
Here we can clearly see that the higher the age, the more probability people have to have covid.
Female
linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.1801214 0.2427063
Male
linear = b0.post+b2.post * 1
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.2430516
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.2119389 0.2799126
We see some indication that the male population has higher probability to have covid but it is not significant.
Not hospitalized
linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.1801214 0.2427063
Hospitalized
linear = b0.post+b3.post * 1
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.2867728
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.2333096 0.3457791
Now, this is really significant and it makes sense. If a patient has been hospitazlied, it is really likely that he/she has covid.
Not pneumonia
linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.1801214 0.2427063
Pneumonia
linear = b0.post+b4.post * 1
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.3547374
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.2958420 0.4179099
Same conclusions for pneumonia
Not cardiovascular
linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.1801214 0.2427063
Cardiovascular
linear = b0.post+b5.post * 1
pred.prob = exp(linear)/(1+exp(linear))
mean(pred.prob)
## [1] 0.1104778
quantile(pred.prob,c(0.025,0.975))
## 2.5% 97.5%
## 0.06538988 0.17185446
But for cardiovascular we are not really sure that it plays a big role, so we will not make assumptions.
We have seen that many variables in this data set are not useful to predict if a patient has covid or not, but with the Bayesian approach and lasso, we have found some that are significant (in this case: age, patient type, and pneumonia) this makes sense. But the power of Bayesian approach is that we obtain the posterior distribution so we can see how significant and if we trust the variable. As for example, the sex and the cardiovascular had really big confidence intervals so we discarded them. This shows the usefulness of the Bayesian approach in comparison with the frequentest one.